Project Description:

A project from Medical domain. The dataset created by Max Little of the University of Oxford in collaboration with the National Centre for Voice and Speech, Denver, Colorado, is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease.

Context:

Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.

The dataset is extracted from the paper: 'Exploiting NonLinear Recurrence and Fractual Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering Online 2007, 6:23 (23 June, 2007)

Data Description:

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.

The columns are as follows:

name - ASCII subject name and recording number

MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency

MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA - Several measures of variation in amplitude

NHR, HNR - Two measures of ratio of noise to tonal components in the voice

status - Health status of the subject (one) - Parkinson's, (zero) - healthy

RPDE, D2 - Two nonlinear dynamical complexity measures

DFA - Signal fractal scaling exponent

spread1, spread2, PPE - Three nonlinear measures of fundamental frequency variation.

Objective:

The goal is to classify the patients into the respective labels using the attributes from their voice recordings.

Import necessary libraries

1. Load the dataset

2. Eye-balling the raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, types of attributes and a general idea of likely challenges in the dataset.

One of the biggest challenge in the dataset according to me is understanding the each and every attributes clearly i.e., what the attributes mean. The attributes are heavily doused with medical terms, which makes it quite difficult to understand what each attributes mean without having sufficient domain knowledge.

Shape of the data

The two-dimensional dataframe i.e., parkinson_data consists of 195 rows and 24 columns.

Dataframe of each attribute

All the attributes apart from name contains numerical values.

To check presence of missing values

None of the columns have null values.

Finding unique value in each attribute

As informed in the Data Description section mentioned above, we can see that apart from the attribute status, every other attribute has continuous values.

As attribute name is not useful for this analysis we can make it as index.

3. Using univariate & bivariate analysis to check the individual attributes for their basic statisitics such as central values, spread, tails, relationships between variables etc.

5 point summary of numerical attributes

The numerical attributes are summarised in the following manner:

i. MDVP:Fo(Hz): There are 195 records with a mean value of 154.23 Hz. The minimum and maximum frequency recorded by the individuals are 88.33 Hz and 260.11 Hz respectively. 25% of people have an average vocal fundamental frequency under 117.57 Hz, 50% of people have an average vocal fundamental frequency under 148.79 Hz whereas 75% of people have an average vocal fundamental frequency under 182.77 Hz. Also, the observations differ from the mean value by 41.39 Hz

ii. MDVP:Fhi(Hz): There are 195 records with a mean value of 197.10 Hz. The greatest minimum and maximum frequency recorded over period-to-period by the individuals are 102.15 Hz and 592.03 Hz respectively. For 25% of the observed people the value is under 134.86 Hz, for 50% of people it is under 175.83 Hz whereas for 75% of people it is under 224.21 Hz. Also, the observations differ from the mean value by 91.49 Hz

iii. MDVP:Flo(Hz): There are 195 records with a mean value of 116.32 Hz. The lowest minimum and maximum frequency recorded over period-to-period by the individuals are 65.48 Hz and 239.17 Hz respectively. For 25% of the observed people the value is under 84.29 Hz, for 50% of people it is under 104.32 Hz whereas for 75% of people it is under 140.02 Hz. Also, the observations differ from the mean value by 43.52 Hz

iv. MDVP:Jitter(%): There are 195 records with a mean value of 0.0062%. The minimum and maximum value recorded for the observed individuals are 0.00168% and 0.03316% respectively. For 25% of the observed people the value is under 0.00346%, for 50% of people it is under 0.00494% whereas for 75% of people it is under 0.007365%. Also, the observations differ from the mean value by 0.0049%

v. MDVP:Jitter(Abs): There are 195 records with a mean value of 0.000044. The minimum and maximum variablitity of the pitch within the analyzed voice sample for the observed individuals are 0.000007 and 0.00026 respectively. For 25% of the observed people the value is under 0.00002, for 50% of people it is under 0.00003 whereas for 75% of people it is under 0.00006. Also, the observations differ from the mean value by 0.000035

vi. MDVP:RAP: There are 195 records with a mean value of 0.003306. The minimum and maximum variablitity of the pitch within the analyzed voice sample with a smoothing factor (of 3 periods) for the observed individuals are 0.00068 and 0.021440 respectively. For 25% of the observed people the value is under 0.00166, for 50% of people it is under 0.0025 whereas for 75% of people it is under 0.003835. Also, the observations differ from the mean value by 0.002968

vii. MDVP:PPQ: There are 195 records with a mean value of 0.003446. The minimum and maximum variablitity of the pitch within the analyzed voice sample with a smoothing factor (of 5 periods) for the observed individuals are 0.00092 and 0.01958 respectively. For 25% of the observed people the value is under 0.00186, for 50% of people it is under 0.00269 whereas for 75% of people it is under 0.003955. Also, the observations differ from the mean value by 0.002759

viii. Jitter:DDP: There are 195 records with a mean value of 0.00992. For the observed persons it ranges from 0.00204 to 0.06433. For 25% of the observed people the value is under 0.004985, for 50% of people it is under 0.00749 whereas for 75% of people it is under 0.011505. Also, the observations differ from the mean value by 0.008903

ix. MDVP:Shimmer: There are 195 records with a mean value of 0.0297. For the observed persons it ranges from 0.00954 to 0.11908. For 25% of the observed people the value is under 0.016505, for 50% of people it is under 0.02297 whereas for 75% of people it is under 0.037885. Also, the observations differ from the mean value by 0.018857

x. MDVP:Shimmer(dB): There are 195 records with a mean value of 0.28 dB. For the observed persons it ranges from 0.085 dB to 1.302 dB. For 25% of the observed people the value is under 0.149 dB, for 50% of people it is under 0.221 dB whereas for 75% of people it is under 0.35 dB. Also, the observations differ from the mean value by 0.195 dB

xi. Shimmer:APQ3: There are 195 records with a mean value of 0.015664. For the observed persons it ranges from 0.00455 to 0.05647. For 25% of the observed people the value is under 0.008245, for 50% of people it is under 0.01279 whereas for 75% of people it is under 0.020265. Also, the observations differ from the mean value by 0.010153

xii. Shimmer:APQ5: There are 195 records with a mean value of 0.017878. For the observed persons it ranges from 0.0057 to 0.0794. For 25% of the observed people the value is under 0.00958, for 50% of people it is under 0.01347 whereas for 75% of people it is under 0.02238. Also, the observations differ from the mean value by 0.012024

xiii. MDVP:APQ: There are 195 records with a mean value of 0.024081. For the observed persons it ranges from 0.00719 to 0.13778. For 25% of the observed people the value is under 0.01308, for 50% of people it is under 0.01826 whereas for 75% of people it is under 0.0294. Also, the observations differ from the mean value by 0.016947

xiv. Shimmer:DDA: There are 195 records with a mean value of 0.046993. For the observed persons it ranges from 0.01364 to 0.16942. For 25% of the observed people the value is under 0.024735, for 50% of people it is under 0.03836 whereas for 75% of people it is under 0.060795. Also, the observations differ from the mean value by 0.030459

xv. NHR: There are 195 records with a mean value of 0.024847. For the observed persons it ranges from 0.00065 to 0.31482. For 25% of the observed people the value is under 0.00065, for 50% of people it is under 0.005925 whereas for 75% of people it is under 0.01166. Also, the observations differ from the mean value by 0.040418

xvi. HNR: There are 195 records with a mean value of 21.885974. For the observed persons it ranges from 8.441 to 33.047. For 25% of the observed people the value is under 19.198, for 50% of people it is under 22.085 whereas for 75% of people it is under 25.0755. Also, the observations differ from the mean value by 4.425764

xvii. status: It is clear that the maximum number of observed individuals does have Parkinson Disease.

xviii. RPDE: There are 195 records with a mean value of 0.498536. For the observed persons it ranges from 0.25657 to 0.685151. For 25% of the observed people the value is under 0.421306, for 50% of people it is under 0.495954 whereas for 75% of people it is under 0.587562. Also, the observations differ from the mean value by 0.103942

xix. DFA: There are 195 records with a mean value of 0.718099. For the observed persons it ranges from 0.574282 to 0.825288. For 25% of the observed people the value is under 0.674758, for 50% of people it is under 0.722254 whereas for 75% of people it is under 0.761881. Also, the observations differ from the mean value by 0.055336

xx. spread1: There are 195 records with a mean value of -5.684397. For the observed persons it ranges from -7.964984 to -2.434031. For 25% of the observed people the value is under -6.450096, for 50% of people it is under -5.720868 whereas for 75% of people it is under -5.046192. Also, the observations differ from the mean value by 0.008903

xxi. spread2: There are 195 records with a mean value of 0.22651 For the observed persons it ranges from 0.006274 to 0.450493. For 25% of the observed people the value is under 0.174351, for 50% of people it is under 0.218885 whereas for 75% of people it is under 0.279234. Also, the observations differ from the mean value by 0.083406

xxii. D2: There are 195 records with a mean value of 2.381826. For the observed persons it ranges from 1.423287 to 3.671155. For 25% of the observed people the value is under 2.099125, for 50% of people it is under 2.361532 whereas for 75% of people it is under 2.636456. Also, the observations differ from the mean value by 0.382799

xxiii. PPE: There are 195 records with a mean value of 0.206552. For the observed persons it ranges from 0.044539 to 0.527367. For 25% of the observed people the value is under 0.137451, for 50% of people it is under 0.194052 whereas for 75% of people it is under 0.25298. Also, the observations differ from the mean value by 0.090119.

Univariate Analysis:

From the above plot it seems that the curve is slightly positively skewed.

The curve being slightly positive skewed is being ascertained.

From the above plot it is clear that the attribute 'MDVP:Fo(Hz)' doesn't have any outliers in them.

From the plot it is clear that the there are individuals having extreme values of maximum vocal fundamental frequency.

The curve is highly positively skewed.

From the above plot it is clear that 'MDVP:Fhi(Hz)' does have outliers with them. The number of outliers can be calculated as follows:

Thus, there are 11 values in 'MDVP:Fhi(Hz)' which are extreme as compared to other observations in the same attribute .

From the graph it is clear that the curve is skewed positively.

The curve being positively skewed is ascertained here.

As, seen from the above plot, there are some outliers in 'MDVP:Flo(Hz)'. The number of outliers can be calculated as:

Thus, there are 9 values in 'MDVP:Flo(Hz)' which are extreme as compared to other observations in the same attribute .

The curve is a highly skewed one, thereby suggesting the presence of outliers.

The curve is highly positively skewed.

As seen from the above plot, 'MDVP:Jitter(%)' does have outliers. The number of outliers can be calculated as:

Thus, there are 14 values in 'MDVP:Jitter(%)' which are extreme as compared to other observations in the same attribute .

The curve is a highly skewed one, thereby suggesting the presence of outliers.

The curve is highly positively skewed.

From the above plot it is clear that outliers are present in 'MDVP:Jitter(Abs)'. The number of outliers can be calculated as:

Thus, there are 6 values in 'MDVP:Jitter(Abs)' which are extreme as compared to other observations in the same attribute .

The curve is a highly skewed one, thereby suggesting the presence of outliers.

The curve is highly positively skewed.

From the above plot it is evident that 'MDVP:RAP' does have outliers in them. The number of outliers can be calculated as:

Thus, there are 14 values in 'MDVP:RAP' which are extreme as compared to other observations in the same attribute .

The curve is a highly skewed one, thereby suggesting the presence of outliers.

The curve is highly positively skewed.

From the above plot it is clear that 'MDVP:PPQ' does have outliers in them. The number of outliers can be calculated as:

Thus, the attribute 'MDVP:PPQ' have 15 extreme values.

The curve is a highly skewed one, thereby suggesting the presence of outliers.

The curve is highly positively skewed.

From the above plot it is clear that 'Jitter:DDP' have outliers. The number of outliers can be calculated as:

Thus, in 'Jitter:DDP' there are 14 extreme values as compared to its other values.

From the graph it is clear that it is positively skewed.

The curve being positively skewed is being ascertained here.

From the above plot it is clear that the attribute 'MDVP:Shimmer' have outliers. The number of outliers can be calculated as:

Thus, there are 8 values in 'MDVP:Shimmer' that are being calculated as extremes.

The curve is positively skewed.

The curve being positively skewed is ascertained here.

'MDVP:Shimmer(dB)' have outliers in them. The number of outliers can be calculated as:

Thus, 10 values in 'MDVP:Shimmer(dB)' have extreme values in them.

From the above plot it is clear that the attribute does have outliers in them.

The curve is positively skewed.

The number of outliers in 'Shimmer:APQ3' can be calculated as:

Thus, there are 6 values in 'Shimmer:APQ3' which are considered as outliers.

The curve is positively skewed with presence of outliers in them.

The curve is positively skewed here.

The number of outliers in 'Shimmer:APQ5' is calculated as:

Thus, there are 13 extreme values in 'Shimmer:APQ5'.

The curve is highly positively skewed with presence of outliers in them.

The curve is highly positively skewed.

From the above plot it is clear that there are outliers in 'MDVP:APQ' whose strength can be calculated as:

Thus, 12 values in 'MDVP:APQ' are considered extremes.

The curve is positively skewed.

The curve being positively skewed is ascertained here.

The number of outliers in 'Shimmet:DDA' can be calculated as:

Thus, there are 6 extreme values in 'Shimmer:DDA'.

The curve is highly positively skewed with presence of outliers.

The curve is highly positively skewed,

The number of outliers in 'NHR' can be calculated as:

Thus, there are 19 values in 'NHR' that are considered outliers.

The curve is slighty negatively skewed.

The curve beingg negatively skewed is ascertained here.

From the above plot it is clear that 'HNR' does have outliers in them. The number of outliers can be calculated as:

Thus, there are only 3 extreme values in 'HNR'.

From the plot it seems that the attribute is almost normally distributed.

The curve is slightly negatively skewed.

There are no outliers in 'RPDE'.

The attribute is almost normally distributed.

The curve is slightly negatively skewed.

From the plot it is clear that 'DFA' doesn't have any outliers in them.

From the above plot it seems that the attribute is almost normally distributed.

The curve is slightly positively skewed.

From the above plot it is clear that the attribute 'spread1' does have outliers in them. The number of outliers can be calculated as:

Thus, there are only 4 extreme values in 'spread1'.

The attribute is almost normally distributed.

The curve is slightly positively skewed.

From the plot it is clear that 'spread2' have outliers in them. The number of outliers can be calculated as:

Thus, in 'spread2' only 2 values are extreme as compared to its other values.

The attribute is almost normally distributed.

The curve is slightly positively skewed.

From the above plot it is clear that 'D2' have outliers. The number of outliers can be calculated as:

Thus, there is only one outlier in 'D2'.

The curve is slightly positively skewed.

The curve being positively skewed is ascertained here.

From the above plot it is clear that the 'PPE' have outliers in them. The number of outliers can be calculated as:

Thus, there are only 5 extreme values in 'PPE'.

'status' is here the target or the dependent variable. From the above plot it is clear that the number of persons having Parkinson disease is much higher than those not having the disease. The ratio is almost 1:3. So we can assume that the model will have a much better chance to predict status = 1 than predicting status = 0.

Bivariate Analysis

Here we will be visualize as to how the different independent attributes vary with respect to the dependent attribute - 'status'.

Multivariate Analysis

This plot along with correlation matrix and heatmap will help us to analyze the relationship between the different attributes.

Thus, from the above three we can see that MDVP:Jitter(%) and MDVP:RAP, MDVP:Jitter(%) and Jitter(DDP), MDVP:Shimmer and MDVP:Shimmer(dB), MDVP:Shimmer and Shimmer:APQ3 and MDVP:Shimmer and Shimmer:DDA all have a correlation value of 0.99.

4. Preparation of data for models and Splitting them into test and training dataset

As MDVP:Jitter(%) has a correlation of 0.99 with MDVP:RAP and Jitter(DDP) and similarly, MDVP:Shimmer has a correlation of 0.99 with MDVP:Shimmer(dB), Shimmer:APQ3 and Shimmer:DDA, so, we can drop off attributes: MDVP:Jitter(%) and MDVP:Shimmer.

Splitting of Data into Training and Test Set in the ratio of 70:30 respectively

Here, the independent variables are denoted by 'X' and the predictor is represented by 'y'.

We will also standardize the dataset:

5. Train at least 3 standard classification algorithms and note down their accuracies on the test data.

Logistic Regression

The accuracy obtained with the standardized data set is 86.44%.

Thus, the LogisticRegression Classifier has predicted 11 records as positive which were actually positive but 8 positive values as negative. Also, the model didn't predicted any actual negative values as positive. The model was able to successfully predict 40 negative records.

K-NN

So, here we will consider the value of k = 1.

Thus, the accuracy obtained from the model based on KNN is 91.52%.

From the above Classification Matrix it is clear that the model has predicted 16 records as Positive which were actually positive and 3 records as False Positive. Also, the model had correctly identified 38 records as Negative. However, the model had predicted 2 records as Negative which were actually Positive.

Naive Bayes

Thus, the accuracy obtained from the model based on Naive-Bayes classifier is 67.8%.

From the above Classification Matrix it is clear that the model has predicted 13 records as Positive which were actually positive and 6 records as False Positive. Also, the model had correctly identified 27 records as Negative. However, the model had also predicted 13 records as False Negative.

Support Vector Machine (SVM)

Thus, the accuracy obtained from the model based on SVM is 93.22%.

From the above Classification Matrix it is clear that the model has predicted 15 records as Positive which were actually positive and 4 records as False Positive. Also, the model had correctly identified 40 records as Negative and it didn't predicted any records as positive which were actually negative.

6.Train a Meta Classifier and note the accuracy on test data.

Here a KNeighborsClassifier, Support Vector Classifier (SVC) and a Naive Bayes Classifier (GaussianNB) will be individually trained. The performance of each classifier will be measured using accuracy score. Finally, we will stack the predictions of these classifiers using the StackingCVClassifier object by using Logistic Regression classifier as the meta classifier and compare the results.

Thus, here we can say that even though we had stacked the models the overall accuracy against testing data of StackingClassifier (=91.5%) is less than that obtained by Support Vector Machine Classifier.

From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 40 records as Negative. However, the model didn't predicted any negative item as positive.

7. Train atleast one ensemble model and note the accuracy.

Decision Tree

Thus, the accuracy obtained when DecisionTree Classifier is used with criterion = 'gini' is 89.83%

From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 39 records as Negative and 1 negative record as positive.

Regularization of Decision Tree

Now, without pruning the Decision Tree we obtained a test accuracy of 89.83%. Now let us try to prune the decision tree by changing the value of the argument 'max_depth':

Here, we can see that after max_depth = 5, the accuracy plateaus. So, let us try to model our DecisionTree Model at max_depth = 5

Thus, we can see that even after pruning the model the acuracy remains the same.

Bagging

Thus, the accuracy obtained in this case is 83.05%.

Regularization of Bagging Model

From above we can see that after 50 base estimators the accuracy reaches a plateau. So,

Thus, on increaing the size of ensemble from 10 to 50, there has been a considerable increase in the accuracy of the model. The improved accuracy stands at 88.14%.

From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 39 records as Negative and 1 negative record as positive.

Compare all the models and pick the best one among them

Before deciding which model (Logistic Regression, KNN, Naive Bayes, Support Vector Machine, StackingClassifier and BaggingClassifier) is best lets summarise the results from each of the model. All the six models had predicted 19 records as Positive (individuals not being affected by Parkinson) and 40 records as Negative (individuals being affected by Parkinson disease) with varying amount of True Positive, True Negative, False Positive and False Negative.

Logistic Regression:

This algorithm provided an accuracy of 86.44 %. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict for 11 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all of them.

KNN (K-Nearest Neighbor):

This algorithm provided an accuracy of 91.52%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict for 16 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict 38 of them. For 2 of the records (i.e., person suffering from Parkinson) it was unable to predict them.

Naive Bayes:

This algorithm provided an accuracy of 67.8%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict only 13 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict only 27 of them. It missed 13 records where the person was actually suffering from Parkinson.

Support Vector Machine (SVC):

This algorithm provided an accuracy of 93.22%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 15 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all 40 of them.

StackingClassifier:

This algorithm provided an accuracy of 91.5%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 14 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all 40 of them.

Bagging Classifier:

This algorithm provided an accuracy of 88.14%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 14 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict 39 of them. It just missed 1 records where the person was actually suffering from Parkinson.

Thus, we can see that Support Vector Machine (SVC) has the best accuracy among all the six algorithms that has been used here. Apart from these it had also identified the most number of persons suffering from Parkinson correctly (alongwith StackingClassifier and Logistic Regression).

Thus, in this case we can say that Support Vector Machine (SVC) is the best model out of the six.